Geographical analysis of media flows
A multidimensional approach
Introduction
1 Corpus preparation
The aim of this section is to prepare a corpus of news related to a language and one or several countries over a period of time. As an example, we will try to prepare a corpus of news in french (fr) related to France (FRA), Belgium (BEL) and Algeria (DZA) over a period of 2 years from 1st Jan 2014 to 31th December 2015. As news include not only titles but also descriptions, we decide to break the descriptions in sentences and keep a maximum of 2 sentences by news, with a maximum of 100 tokens by sentence.
The data used in this example has been collected by the research project ANR Geomedia and are free to use for scientific and pedagogical purpose only. The content of the news should not be used or disseminated without the agreement of the newspapers.
1.1 Selection of media
We import the data provided by each media and put them in a single data.frame. Then we select the columns of interest
# Load data with the function fread (fast) and the encoding UTF-8
df1<-fread("data/source/fr_FRA_figaro_int.csv", encoding = "UTF-8")
df1$media<-"fr_FRA_figaro"
df2<-fread("data/source/fr_FRA_libera_int.csv", encoding = "UTF-8")
df2$media<-"fr_FRA_libera"
df3<-fread("data/source/fr_BEL_derheu_int.csv", encoding = "UTF-8")
df3$media<-"fr_BEL_derheu"
df4<-fread("data/source/fr_BEL_lesoir_int.csv", encoding = "UTF-8")
df4$media<-"fr_BEL_lesoir"
df5<-fread("data/source/fr_DZA_elwata_int.csv", encoding = "UTF-8")
df5$media<-"fr_DZA_elwata"
df6<-fread("data/source/fr_DZA_xpress_int.csv", encoding = "UTF-8")
df6$media<-"fr_DZA_xpress"
# transform in data.table format
df<-rbind(df1,df2,df3,df4,df4,df5,df6)
rm(df1,df2,df3,df4,df5,df6)
# select column of interest
df$id <- df$ID_Item
df$who <- df$media
df$when <- df$Date_Recup
df$text <- paste(df$Titre," . ", df$Description, sep="")
df<-df[,c("id","who","when","text")]
df<-df[order(when),]
# select period of interest
mintime<-as.Date("2014-01-01")
maxtime<-as.Date("2015-12-31")
df<-df[(is.na(df$when)==F),]
df<-df[as.Date(df$when) >= mintime,]
df<-df[as.Date(df$when) <= maxtime,]
# eliminate duplicate
df<-df[duplicated(df$text)==F,]1.2 Check of time frequency
1.2.1 Time divisions
We transform the previous data.frame in a data.table format for easier operations of aggregation
dt<-as.data.table(df)
dt$day <- as.Date(dt$when)
dt$week <- cut(dt$day, "weeks", start.on.monday=TRUE)
dt$month <- cut(dt$day, "months")
dt$weekday <- weekdays(dt$day)
# Save data frame
saveRDS(dt,"data/corpus/dt_mycorpus.RDS") 1.2.2 News by week
We examine if the distribution is regular by week for the different media of the corpus.
dt<-readRDS("data/corpus/dt_mycorpus.RDS")
news_weeks<-dt[,.(newstot=.N),by=.(week,who)]
p<-ggplot(news_weeks, aes(x=as.Date(week),y=newstot, col=who))+
geom_line()+
geom_smooth(method = 'loess', formula = 'y~x')+
scale_y_continuous("Number of news", limits = c(0,NA)) +
scale_x_date("Week (starting on monday)") +
ggtitle(label ="Corpus : distribution of news by week",
subtitle = "1st Jan 2014 to 31th Dec. 2015")
p1.2.3 News by weekday
We examine if the distribution is regular by weekday and check in particular the effect of the week-end.
#compute frequencies by weekday
news_weekdays<-dt[,.(newstot=.N),by=.(weekday,who)]
news_weekdays<-news_weekdays[,.(weekday,newspct=100*newstot/sum(newstot)),by=.(who)]
# Translate weekdays in english and order
news_weekdays$weekday<-as.factor(news_weekdays$weekday)
levels(news_weekdays$weekday)<-c("7.Sunday","4.Wednesday","1.Monday","2.Tuesday","3.Thursday","6.Sathurday","5.Friday")
news_weekdays$wkd<-as.factor(as.character(news_weekdays$weekday))
news_weekdays<-news_weekdays[order(news_weekdays$weekday),]
p<-ggplot(news_weekdays, aes(x=weekday,fill = who, y=newspct))+
geom_bar(position = "dodge", stat="identity")+
scale_y_continuous("Share of news (%)", limits = c(0,NA)) +
ggtitle(label ="Corpus : distribution of news by week day",
subtitle = "1st Jan 2014 to 31th Dec. 2015")
p1.3 Transform in quanteda corpus
1.3.1 Reshape news by sentences
The aim of this step is to harmonize the length of texts collected through rss. We decide to keep only the title of news and the two first sentences of descriptions when they are available. The result is stored in quanteda format.
Unfortunately, the division in text is sentences realized by quanteda is far from perfect which is due to problems in the collection of news. For example, the following text will be considered as a single sentence because the point is not followed by a blank character.
Le conflit est terminé.Mais la Russie est-elle d’accord avec la Turquie.
It is necessary to add a regular expression for the cleaning of text and the inclusion of a blank space " " after each point located after a lower case character and before an upper case character :
str_replace_all(txt,“(?<=[:lower:])\.(?=[:upper:])”, “\.”)
In order to obtain a text that will be recognised as made of 2 sentences.
Le conflit est terminé. Mais la Russie est-elle d’accord avec la Turquie.
Some sentences appears too short or too long for a sound analysis. Therefore, we decide to eliminate outliers based on the quantile of the numbe of tokens. In practice we decide to eliminate the sentences with more than 100 tokens or less than 3 tokensr
t1<-Sys.time()
dt<-readRDS("data/corpus/dt_mycorpus.RDS")
# clean sentences break (long !)
dt$text<-str_replace_all(dt$text,"(?<=[:lower:])\\.(?=[:upper:])", "\\. ")
# transform in quanteda
qd<-corpus(dt,docid_field = "id",text_field = "text")
# break in sentences
qd<-corpus_reshape(qd,to="sentences", use_docvars=T)
# Identify rank of sentences
qd$order<-as.numeric(as.data.frame(str_split(names(qd),"\\.", simplify=T))[,2])
# Select only title + maximum of 3 sentences
qd<-corpus_subset(qd, order < 5)
# filter by number of tokens by sentence
qd$nbt<-ntoken(texts(qd))
#mintok<-quantile(qd$nbt,0.01)
#maxtok<-quantile(qd$nbt,0.99)
#qd<-corpus_subset(qd, nbt>mintok)
qd<-corpus_subset(qd, nbt<100)
qd<-corpus_subset(qd, nbt>2)
# Save corpus in qd format
saveRDS(qd,"data/corpus/qd_mycorpus.RDS")
t2<-Sys.time()
paste("Program executed in ", t2-t1)
head(qd)
summary(qd,3)1.3.2 Number of sentences by media
We check the number of sentences available by title (1) and order of sentences in description (2 to 5)
qd<-readRDS("data/corpus/qd_mycorpus.RDS")
x<-data.table(docvars(qd))
tab<-x[,.(tot=.N),by=.(who,order)]
tab<-dcast(tab,order~who)
tab$order<-as.factor(tab$order)
levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
kable(tab, caption = "Distribution of title and sentences by media")| order | fr_BEL_derheu | fr_BEL_lesoir | fr_DZA_elwata | fr_DZA_xpress | fr_FRA_figaro | fr_FRA_libera |
|---|---|---|---|---|---|---|
| Title | 6994 | 10815 | 2896 | 4794 | 9449 | 13703 |
| Sent1 | 6962 | 10730 | 2925 | 4755 | 9423 | 11227 |
| Sent2 | 1884 | 3517 | 2893 | 249 | 3591 | 3453 |
| Sent3 | 518 | 995 | 2867 | 16 | 536 | 444 |
1.3.3 Size of texts by month
We visualize the distribution of sentences of different order through time in order to prepare a decision on the length of text to be kept.
tab<-x[,.(tot=.N),by=.(month,order)]
tab$month<-as.Date(tab$month)
tab$order<-as.factor(tab$order)
levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
p<-ggplot(tab, aes(x=month,fill = order, y=tot))+
geom_bar(stat="identity")+
ggtitle(label ="Corpus : distribution of titles and sentences by month",
subtitle = "1st Jan 2014 to 31th Dec. 2015")
p4 Hypercubes creation
This section is based on the TELEMAC application elaborated during the H2020 projected ODYCCEUS and presented in the paper published in the journal Frontiers and available at https://analytics.huma-num.fr/Claude.Grasland/telemac/
Our objective is to elaborate an hypercube organised by different dimensions. As an example, we suppose that we are interested in the analysis of the crisis of migrant and refugees (what) in different newspapers (who), at different period of time (when) and we want to explore the locations of countries that are mentioned (where) and eventually associated together (where1.where2). Finally we want to distinguish inside the news the possible changes of results if we consider the title or the first, second and third sentences of the description (order).
4.1 Definition of dimensions
To illustrate this different options, we can look at the example of a news published by the Algerian newspaper El Watan the 16th September 2015 and divided in a title and three sentences of description.
qd<-readRDS("data/corpus/qd_mycorpus_states_topics.RDS")
examp<-corpus_subset(qd,docid(qd) == 9486265)
kable(paste(examp))| x |
|---|
| Crise des réfugiés en Europe : Vers un conseil des chefs d’Etat et de gouvernement de l’UE . |
| L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire. |
| Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP. |
| «C’est un problème pour l’Union européenne dans son ensemble, c’est pourquoi nous nous sommes prononcés pour la tenue la semaine prochaine d’un conseil extraordinaire de l’UE», a déclaré la chancelière allemande lors d’une conférence de presse avec son homologue autrichien Werner Faymann. |
Thanks to the previous operations of geographical and topical tagging, we can propose a simplified table where the text of the news has been removed and where we keep only the information of interest for the agregation procedure.
examp$id<-as.character(docid(examp))
dtexamp<-data.table(tidy(examp)) %>% select(id=id, order = order, who = who, when=day, what=mobil, where1 = states, where2=states)
kable(dtexamp)| id | order | who | when | what | where1 | where2 |
|---|---|---|---|---|---|---|
| 9486265 | 1 | fr_DZA_elwata | 2015-09-16 | refu | ||
| 9486265 | 2 | fr_DZA_elwata | 2015-09-16 | migr | DEU AUT SVK | DEU AUT SVK |
| 9486265 | 3 | fr_DZA_elwata | 2015-09-16 | refu migr | BEL DEU AUT SVK | BEL DEU AUT SVK |
| 9486265 | 4 | fr_DZA_elwata | 2015-09-16 | DEU AUT | DEU AUT |
The hypercube is the result of an aggregation of foreign news according several dimensions:
who : this dimension is related to the variable which describe the media outlets which published the RSS feeds. Each source is related to a code
ll_sss_xxxxxxwherellis the language,sssis the ISO3 code of the country andxxxxxxthe name of the media. For instance, a RSS feed produced by the Algerian newspaper El Watan is identified by the code who =fr_DZA_elwata. Starting from there, it is then possible to proceed to aggregation of the data by group of languages (eg. computation of the indicators for all the French speaking newspapers) or countries (compute the indicators for all the media outlets located in Algeria).when : this dimension describe the day when an article of the RSS feeds has been published, according to a reference time zone (Paris in present case). Starting from the day, the data will be further aggregated according to different period of aggregation: weeks, months, quarters or years. . For instance, by choosing to work on monthly aggregated data, the first period of observation for the news presented as example will be:
when = 2015-09-01. If we choose a division in weeks, we have to decide if the week start on Sunday (default option of R) or start on Monday (option adopted in present case)where1 and where2 : this dual dimension is associated to the cross-list of foreign countries detected by the country dictionary in the news. For example the second sentence of our exampple (“L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire.”) has produced a list of three places (DEU,AUT,SVK) associated to the cross-list of nine couple of places (AUT-AUT, AUT-DEU, AUT-SVK, DEU-DEU, DEU-AUT, DEU-SVK, SVK-AUT, SVK-DEU, SVK-SVK) where each couple will receive a weight of 1/9. It is important to keep in mind that the countries where the media are located (mentioned in the
whodimension) should be excluded from the list if we decide to work only on foreign news.what : In general, this dimension can be described as a boolean value (TRUE/FALSE) which precise if the news is associated or not to the topic of interest. For example the title and the two first sentences of our example are associated to the topic of international mobility but not the third sentence where the expected keywords has not been found. But if we have introduced subtopics, the situation is more complex because the news can be associated to different subtopics (as it was associated to different states). For example the second sentence of the description (“Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP”) is associated to 2 subtopics (refug, migr) and 4 countries (BEL, AUT, DEU, SVK). It will therefore be broken in 2 x 4 x4 = 32 pieces of information, each of them associated to a value of 1/16th.
order : To build the hypercube, it is possible to works on different size of text units: (
order=1): the title or the first sentence, or (order = 2,3,4, ...): the title with the selected number of sentence of the description available. This parameter is important because some results, especially regarding the spatial dimension of the analysis (where) are more noticeable on longer texts. In our example, it is clear that the conclusions would be different if we had decided to focus only on the title which does not mention any country and is only associated to the subtopic of refugees.
4.2 Aggregation function
The elaboration of the hypercube is based on the crossing of all dimensions with one line for each singular combination. To do that, we have elaborated a specific function that combine all the 6 dimensions but can be easily adapted if less dimensions are needed.
#' @title create an hypercube
#' @name hypercube
#' @description create a network of interlinked states
#' @param corpus a corpus of news in quanteda format
#' @param order an order of sentences in the news
#' @param who the source dimension
#' @param when the time dimension
#' @param timespan aggreation of time
#' @param what a list of topics
#' @param where1 a list of states
#' @param where2 a list of states
hypercube <- function( corpus = qd,
order = "order",
who = "source",
when = "when",
timespan = "week",
what = "what",
where1 = "where1",
where2 = "where2")
{
# prepare data
don<-docvars(corpus)
df<-data.table(id = docid(corpus),
order = don[[order]],
who = don[[who]],
when = don[[when]],
what = don[[what]],
where1 = don[[where1]],
where2 = don[[where2]])
# adjust id
df$id<-paste(df$id,"_",df$order,sep="")
# change time span
df$when<-as.character(cut(as.Date(df$when), timespan, start.on.monday = TRUE))
# unnest where1
df$where1[df$where1==""]<-"_no_"
df<-unnest_tokens(df,where1,where1,to_lower=F)
# unnest where2
df$where2[df$where2==""]<-"_no_"
df<-unnest_tokens(df,where2,where2,to_lower=F)
# unnest what
df$what[df$what==""]<-"_no_"
df<-unnest_tokens(df,what,what,to_lower=F)
# Compute weight of news
newswgt<-df[,list(wgt=1/.N),list(id)]
df <- merge(df,newswgt, by="id")
# ------------------------ Hypercube creation --------------------#
# Aggregate
hc<- df[,.(tags = .N, news=sum(wgt)) ,.(order,who, when,where1,where2, what)]
# Convert date to time
hc$when<-as.Date(hc$when)
# export
return(hc)
}In order to test the function, we apply it firstly on our small example of the single news published by El Watan
hc_test<-hypercube( corpus = examp,
order = "order",
who = "who",
when = "when",
timespan = "day",
what = "mobil",
where1 = "states",
where2 = "states")
kable(hc_test)| order | who | when | where1 | where2 | what | tags | news |
|---|---|---|---|---|---|---|---|
| 1 | fr_DZA_elwata | 2015-09-16 | no | no | refu | 1 | 1.0000000 |
| 2 | fr_DZA_elwata | 2015-09-16 | DEU | DEU | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | DEU | AUT | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | DEU | SVK | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | AUT | DEU | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | AUT | AUT | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | AUT | SVK | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | SVK | DEU | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | SVK | AUT | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | SVK | SVK | migr | 1 | 0.1111111 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | SVK | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | SVK | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | SVK | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | SVK | migr | 1 | 0.0312500 |
| 4 | fr_DZA_elwata | 2015-09-16 | DEU | DEU | no | 1 | 0.2500000 |
| 4 | fr_DZA_elwata | 2015-09-16 | DEU | AUT | no | 1 | 0.2500000 |
| 4 | fr_DZA_elwata | 2015-09-16 | AUT | DEU | no | 1 | 0.2500000 |
| 4 | fr_DZA_elwata | 2015-09-16 | AUT | AUT | no | 1 | 0.2500000 |
- order = 1 : the title is described by a single line because we have only one subtopic and no states mentioned. The weight of the line is 1.
- order = 2 : the first sentence of description is characterized by one subtopic and three different states which produce 9 lines with weight of 1/9 = 0.111 news.
- order = 3 : the second sentence of description is characterized by two subtopic and four diffeent states which produce 32 lines with weight of 1/32 = 0.031 news.
- order = 4 : the last sentence of description is characterized by no topics and 2 states which produce 4 lines with weight of O.25.
4.3 Application
Of course it is not interesting to transform a single news in such a large table. But it is of high interest if we realize the agregation on a large number of news. Because in this case the number of combination of dimensions is limited and we can obtain a synthetic table called hypercube that summarize all the information extracted from the news in a relatively small object. The time of computation of an hypercube can be relatively large and the memory size necessary to the intermediary step of disagregation can be important, but the resulting object is small and very adapted for a large number of exploration and modelisation methods.
In practice, the function based on data.table package appears to be very fast and it appears interesting to build an hypercube for each of the topics of interest.
4.3.1 Pandemic hypercube
hc<-hypercube( corpus = qd,
order = "order",
who = "who",
when = "when",
timespan = "day",
what = "pand",
where1 = "states",
where2 = "states")
saveRDS(hc,"data/corpus/hc_mycorpus_states_pand.RDS")
paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_pand.RDS")/1000000,3), "Mo")[1] "Size of resulting file = 0.384 Mo"
4.3.2 Quake hypercube
hc<-hypercube( corpus = qd,
order = "order",
who = "who",
when = "when",
timespan = "day",
what = "pand",
where1 = "states",
where2 = "states")
saveRDS(hc,"data/corpus/hc_mycorpus_states_quak.RDS")
paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_quak.RDS")/1000000,3), "Mo")[1] "Size of resulting file = 0.384 Mo"
4.3.3 Mobility hypercube
hc<-hypercube( corpus = qd,
order = "order",
who = "who",
when = "when",
timespan = "day",
what = "mobil",
where1 = "states",
where2 = "states")
saveRDS(hc,"data/corpus/hc_mycorpus_states_mobil.RDS")
paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_mobil.RDS")/1000000,3), "Mo")[1] "Size of resulting file = 0.39 Mo"
5 Hypercubes exploration
#source("pgm/hypernews_functions_V6.R")5.1 Objectives
The different dimensions of an hypercube can be analysed through different aggregation of the dimensions of the hypercubes, leading to different tables authorizing different modes of visualization. Each function is named according to the dimensions that are combined. Each function will produce two different outputs, a statistical table and an interactive graphic
5.1.1 Statistical table
Whatever the dimensions we decide to cross, we build a table where we realize a statistical test in order to identify the cells that are characterized by positive or negative outliers i.e. cells where the phenomena of interest (WHAT) is significantly more present or less present than usual. More precisely, the function will produce two for each cell of the cross dimension table :
- a salience index (Xobs/Xest) : defined as the ratio between observed and estimated number of news where the topic is present.
- an outlier index (prob (Xobs > Xest)) : defined as the probability that the number of news where the topic is present is significantly greater than expected.
In both cases we introduce two parameters of control that will limit the computation of indexes to the cells where it appears statistically relevant to realize the measure :
Minimum sample size (minsamp) : is the total number of news present in the cell before to compute the probability of apparition of the topic. The default value is equal to 20 as we consider as not meaningfull to compute a proportion on a smaller sample.
Minimum estimated value (mintest): is the threshold of computation of the chi-square test according to the estimated number of news where the topic is present. According to statistical rules of the chi-square test, this threshold should be equal to 5 for optimal conditions of application. The package R introduce indeed a warning message if the condition is not satisfied, which can increase the time of computation.
Of course, the user can decide to relax or reinforce these two conditions but it is normally better to avoid to do it. When conditions are not fulfilled, the graphic output will not display the cells where the indexes can not be computed.
The function that realize the test is the following one
#### ---------------- testchi2 ----------------
#' @title Compute the average salience of the topic and test significance of deviation
#' @name what
#' @description create a table and graphic of the topic
#' @param tabtest a table with variable trial, success and null.value
#' @param minsamp : Threshold of sample size requested for salience computation
#' @param mintest : Threshold of estimated value requested for chi-square test
testchi2<-function(tabtest=tabtest,
minsamp = 20,
mintest = 5)
{
tab<-tabtest
n<-dim(tab)[1]
# Compute salience if sample size sufficient (default : N>20)
tab$estimate <-NA
tab$salience <-NA
tab$chi2<-NA
tab$p.value<-NA
if (tab$trial > minsamp){ tab$estimate<-round(tab$success/tab$trial,5)
tab$salience<-tab$estimate/tab$null.value
# Chi-square test if estimated value sufficient (default : Nij* > 5)
for (i in 1:n) {
if(tab$trial[i]*tab$null.value[i]>=mintest) {
test<-prop.test(x=tab$success[i],n=tab$trial[i], p=tab$null.value[i],
alternative = "greater")
tab$chi2[i]<-round(test$statistic,2)
tab$p.value[i]<-round(test$p.value,5)
}
}
}
return(tab)
}5.1.2 Interactive graphic
Once the statistical table has been computed, the user can choose between two different visualizations, based on the salience index (exploration) or the chi-square test (ouliers detection). In both case the result will be an interactive figure realized in plotly where it is possible to click on each cell and have a look at the statistical parameters.
The user interested in static graphic (e.g. for publication) can easily adapt the program and realize new functions, for example in ggplot2.
In order to illustrate each type of graphic, we will choose the example of the topic of mobility without distinction between migrants and refugees.
hc <- readRDS("data/corpus/hc_mycorpus_states_mobil.RDS")5.2 Topic frequence (What ?)
The first function has only one dimension and evaluate the proportion of news related to the topic. As a consequence, this function is not associated to a statistical test and return only a table and a graphic presenting the proportion of news where the topic is present or not.
5.2.1 Function
### ---------------- what ----------------
#' @title Compute the average salience of the topic
#' @name what
#' @description create a table and graphic of the topic
#' @param hc an hypercube prepared as data.table
#' @param subtop a subtag of the main tag (default = NA)
#' @param title Title of the graphic
what <- function (hc = hypercube,
subtop = NA,
title = "What ?")
{
tab<-hc
if (is.na(subtop)){tab$what <-tab$what !="_no_"}else {tab$what <- tab$what == subtop}
tab<-tab[,list(news = sum(news)),by = what]
tab$pct<-100*tab$news/sum(tab$news)
p <- plot_ly(tab,
labels = ~what,
values = ~pct,
type = 'pie') %>%
layout(title = title,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
output<-list("table" = tab, "plotly" =p)
return(output)
}5.2.2 Example
res_what <- what(hc = hc,
subtop = NA,
title = "Topic news")
res_what$table what news pct
1: FALSE 112482 97.272476
2: TRUE 3154 2.727524
res_what$plotlyThe table indicate that 3154 news was associated to the topic which represent 2.72% of the total.
5.3 Topic variation by media (who.what)
The function who.what explore the variation of interest for the topic in the different media of the corpus.
5.3.1 Function
#### ---------------- who.what ----------------
#' @title visualize variation of the topic between media
#' @name who.what
#' @description create a table of variation of the topic by media
#' @param hc an hypercube prepared as data.table
#' @param test : visualize test (TRUE) or salience (FALSE)
#' @param minsamp : Threshold of sample size requested for salience computation
#' @param mintest sample size of estimate for chi-square test (default = 5)
#' @param title Title of the graphic
who.what <- function (hc = hypercube,
test = FALSE,
minsamp = 20,
mintest = 5,
title = "Who says What ?")
{
tab<-hc
{tab$what <-tab$what !="_no_"}
tab<-tab[,list(trial = sum(news),success=round(sum(news*what),0)),by = list(who)]
ref <-round(sum(tab$success)/sum(tab$trial),4)
tab$null.value<-ref
tab<-testchi2(tabtest=tab,
minsamp = minsamp,
mintest = mintest)
if (test==FALSE) {tab$index =tab$salience
tab<-tab[tab$trial > minsamp,]
mycol<-brewer.pal(7,"YlOrRd")
}
else {tab$index=tab$p.value
tab<-tab[tab$trial*tab$null.value>mintest,]
mycol<-brewer.pal(7,"RdYlBu")
mycol[4]<-"lightyellow"
}
p <- plot_ly(tab,
x = ~who,
y = ~estimate*100,
color= ~index,
colors= mycol,
hoverinfo = "text",
text = ~paste('Source: ',who,
'<br /> Total news : ', round(trial,0),
'<br /> Topic news : ', round(success,0),
'<br /> % observed : ', round(estimate*100,2),'%',
'<br /> % estimated : ', round(null.value*100,2),'%',
'<br /> Salience : ', round(salience,2),
'<br /> p.value : ', round(p.value,4)),
type = "bar") %>%
layout(title = title,
yaxis = list(title = "% news"),
barmode = 'stack')
output<-list("table" = tab, "plotly" =p)
return(output)
}5.3.2 Example
We present here the statistical table and the two types of graphics that can be produced. In the following case we will only present the outlier graphic.
res_who_what<- who.what(hc=hc,
test = FALSE,
minsamp = 20,
mintest = 5,
title = "Topic news by media - Salience")
kable(res_who_what$table)| who | trial | success | null.value | estimate | salience | chi2 | p.value | index |
|---|---|---|---|---|---|---|---|---|
| fr_FRA_libera | 28827 | 607 | 0.0273 | 0.02106 | 0.7714286 | 42.08 | 1.00000 | 0.7714286 |
| fr_DZA_elwata | 11581 | 292 | 0.0273 | 0.02521 | 0.9234432 | 1.82 | 0.91137 | 0.9234432 |
| fr_BEL_derheu | 16358 | 349 | 0.0273 | 0.02134 | 0.7816850 | 21.69 | 1.00000 | 0.7816850 |
| fr_FRA_figaro | 22999 | 1005 | 0.0273 | 0.04370 | 1.6007326 | 232.26 | 0.00000 | 1.6007326 |
| fr_BEL_lesoir | 26057 | 652 | 0.0273 | 0.02502 | 0.9164835 | 5.01 | 0.98737 | 0.9164835 |
| fr_DZA_xpress | 9814 | 249 | 0.0273 | 0.02537 | 0.9293040 | 1.30 | 0.87310 | 0.9293040 |
res_who_what$plotly res_who_what<- who.what(hc=hc,
test = TRUE,
minsamp = 5,
mintest = 1,
title = "Topic news by media - Significance")
res_who_what$plotlyThe analysis reveal a clear over-representation of the topic in the french newspaper Le Figaro (4.37% of news) as compared to the other media (2.1 to 2.5%).
5.4 Topic variation through time (when.what)
In this case we want to analyze if the topic has been more or less present at one period of time or another. It can therefore be interesting to modify the level of agregation before to do that and transform the initial hypercube (by day) toward another level of agregation. It is also possible to change the size of the time period as the outlier are defined by reference to the whole period of analysis
5.4.1 Function
#### ---------------- when.what ----------------
#' @title visualize variation of the topic through time
#' @name when.what
#' @description create a table of variation of the topic by media
#' @param test : visualize test (TRUE) or salience (FALSE)
#' @param minsamp : Threshold of sample size requested for salience computation
#' @param mintest sample size of estimate for chi-square test (default = 5)
#' @param title Title of the graphic
when.what <- function (hc = hypercube,
test = FALSE,
minsamp = 20,
mintest = 5,
title = "Who says What ?")
{
tab<-hc
{tab$what <-tab$what !="_no_"}
tab<-tab[,list(trial = sum(news),success=round(sum(news*what),0)),by = list(when)]
ref <-round(sum(tab$success)/sum(tab$trial),4)
tab$null.value<-ref
tab<-testchi2(tabtest=tab,
minsamp = minsamp,
mintest = mintest)
if (test==FALSE) {tab$index =tab$salience
tab<-tab[tab$trial > minsamp,]
mycol<-brewer.pal(7,"YlOrRd")
}
else {tab$index=tab$p.value
tab<-tab[tab$trial*tab$null.value>mintest,]
mycol<-brewer.pal(7,"RdYlBu")
mycol[4]<-"lightyellow"
}
p <- plot_ly(tab,
x = ~as.character(when),
y = ~estimate*100,
color= ~index,
colors= mycol,
hoverinfo = "text",
text = ~paste('Time: ',when,
'<br /> Total news : ', round(trial,0),
'<br /> Topic news : ', round(success,0),
'<br /> % observed : ', round(estimate*100,2),'%',
'<br /> % estimated : ', round(null.value*100,2),'%',
'<br /> Salience : ', round(salience,2),
'<br /> p.value : ', round(p.value,4)),
type = "bar") %>%
layout(title = title,
yaxis = list(title = "% news"),
barmode = 'stack')
output<-list("table" = tab, "plotly" =p)
return(output)
}5.4.2 Example 1 : 2014-2015 by month
# Modify time period by month
hc2 <- hc %>% mutate(when = cut(when,breaks="month"))
res_when_what<- when.what(hc=hc2,
test=TRUE,
minsamp=10,
mintest=5,
title = "Topic news by month - Significance")
res_when_what$plotlyThe analysis reveals clear discontinuities in the timeline of the topic. We start with a low level (0.5 to 1.2%) from January 2014 to March 2015, followed by a brutal jump in April-June 2015 (3 to 5%) and a major peak in september 2015 (15.8% of news). At the end of the period, the level is clearly higher than at the beginning.
5.4.3 Example 2 : 2015 by week
# Modify time period by month
hc2 <- hc %>% filter(substr(when,1,4)=="2015") %>% mutate(when = cut(when,breaks="week"))
res_when_what<- when.what(hc=hc2,
test=TRUE,
minsamp=10,
mintest=5,
title = "Topic news by week - Significance")
res_when_what$plotly5.4.4 Example 3 : August-Nov 2015 by day
# Modify time period by month
hc2 <- hc %>% filter(when > as.Date("2015-07-31"), when < as.Date("2015-11-01"))
res_when_what<- when.what(hc=hc2,
test=TRUE,
minsamp=10,
mintest=5,
title = "Topic news by day - Significance")
res_when_what$plotly5.5 Topic variation through space (where.what)
This function analyze the spatial distribution of places associated to the topic. As we have only collected states, we do not take into account the news where the topic of interest is associated to geographical area different from states (e.g. “migrants from subsaharan Africa”). But it is only a minority of cases and the fact to collect states make possible to produce easily a geographical map of the phenomena.
5.5.1 Function
#### ---------------- where.what ----------------
#' @title visualize spatialization of the topic
#' @name where.what
#' @description create a table of variation of the topic by media
#' @param hc an hypercube prepared as data.table
#' @param test : visualize test (TRUE) or salience (FALSE)
#' @param minsamp : Threshold of sample size requested for salience computation
#' @param mintest sample size of estimate for chi-square test (default = 5)
#' @param map a map with coordinates in lat-long
#' @param proj a projection accepted by plotly
#' @param title Title of the graphic
where.what <- function (hc = hypercube,
test = FALSE,
minsamp = 20,
mintest = 5,
map = world_ctr,
proj = 'azimuthal equal area',
title = "Where said What ?")
{
tab<-hc
tab$what <-tab$what !="_no_"
tab<-tab[,list(trial = round(sum(news),0),success=round(sum(news*what),0)),by = list(where1)]
ref <-round(sum(tab$success)/sum(tab$trial),4)
tab$null.value<-ref
tab<-testchi2(tabtest=tab,
minsamp = minsamp,
mintest = mintest)
tab<-tab[order(-chi2),]
if (test==FALSE) {tab$index =tab$salience
tab<-tab[tab$trial > minsamp,]
mycol<-brewer.pal(7,"YlOrRd")
}
else {tab$index=tab$p.value
tab<-tab[tab$trial*tab$null.value>mintest,]
mycol<-brewer.pal(7,"RdYlBu")
mycol[4]<-"lightyellow"
}
map<-merge(map,tab,all.x=T,all.y=F,by.x="ISO3",by.y="where1")
#map2<-map[is.na(map$pct)==F,]
#map2<-st_centroid(map2)
#map2<-st_drop_geometry(map2)
g <- list(showframe = TRUE,
framecolor= toRGB("gray20"),
coastlinecolor = toRGB("gray20"),
showland = TRUE,
landcolor = toRGB("gray50"),
showcountries = TRUE,
countrycolor = toRGB("white"),
countrywidth = 0.2,
projection = list(type = proj))
p<- plot_geo(map)%>%
add_markers(x = ~lon,
y = ~lat,
sizes = c(0, 250),
size = ~success,
# color= ~signif,
color = ~index,
colors= mycol,
hoverinfo = "text",
text = ~paste('Location: ',NAME,
'<br /> Total news : ', round(trial,0),
'<br /> Topic news : ', round(success,0),
'<br /> % observed : ', round(estimate*100,2),'%',
'<br /> % estimated : ', round(null.value*100,2),'%',
'<br /> Salience : ', round(salience,2),
'<br /> p.value : ', round(p.value,4))) %>%
layout(geo = g,
title = title)
output<-list("table" = tab, "plotly" =p)
return(output)
}5.5.2 Example
When we realize the map, we eliminate the news related to the topic where no countries has been mentioned. As a consequence the reference value is modified : in the whole sample 2.73% of news was related to the topic but in the sample of news where one country is mentioned 2.83% of the news are related the topic.
As the total number of news can be small in some countries, we have reduced here the parameters of the statistical test in order to visualize more countries on the map. It is therefore necessary to be cautious in the analysis of results.
map<-readRDS("data/dico_states/world_ctr_4326.Rdata")
hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_")
res_where_what<- where.what(hc=hc2,
test=TRUE,
minsamp=10,
map = map,
mintest =2,
title = "Topic news by states - Significance")
res_where_what$plotlyThe analysis reveals that some countries are “specialized” in the topic during the period of observation. For example 53.5% of the news about Hungary was associated to the question of migrants and refugees, which is obviously related to the mediatization of the wall established by Viktor Orban in 2015. Other countries are characterized on the contrary by an under-representation of the topic like the USA where the topic is only associated to 0.7% of news. But the situation will change after Donald Trump’s election who will also establish a wall which will dramatically increase the number of news about USA and migrants.
5.6 Crossing 3 dimensions ?
It is of course possible to cross more than two dimensions but in this case it is necessary to have a larger sample of news because statistical tests are not possible when the numberof news is too small in the cells. The application TELEMAC, realized during the H2020 project OYCCEUS provides example of analysis combining who.where.what, who.when.what or where.when.what. It is based on a shiny interface with a lot of possibilities of interaction with the data that has not been developed here.
6 Spatial Networks
6.1 Introduction
This final section will focus on the spatial dimension of the analysis and more precisely on the spatial linkages that can be revealed by the news when two countries or more are present in the same sentence. Before to analyze real data, it is important to start with a very small theoretical example in order to explain more clearly what are these spatial linkages.
Consider the following set of 8 news published in 8 languages :
text1 <- "Russia annexes Crimea against Ukraine's advice"
text2 <- "La France, l'Allemagne et les USA protestent contre l'intervention de l'armée russe en Ukraine"
text3 <- "Russlands Intervention in Syrien wird von China unterstützt"
text4 <- "Gli Stati Uniti si rifiutano di intervenire in Siria contro il parere della Francia "
text5 <- "Trump está construyendo un muro entre Estados Unidos y México "
text6 <- "Bangladeş, Myanmar'dan daha fazla mülteci kabul etmeyi reddediyor"
text7 <- "פלסטינים ירו רקטות איראניות לעבר ישראל "
text8 <-"ضحايا الفلسطينيون للهجوم الإسرائيلي المدعوم من الولايات المتحدة "
text<-c(text1,text2,text3,text4,text5, text6,text7, text8)
kable(text)| x |
|---|
| Russia annexes Crimea against Ukraine’s advice |
| La France, l’Allemagne et les USA protestent contre l’intervention de l’armée russe en Ukraine |
| Russlands Intervention in Syrien wird von China unterstützt |
| Gli Stati Uniti si rifiutano di intervenire in Siria contro il parere della Francia |
| Trump está construyendo un muro entre Estados Unidos y México |
| Bangladeş, Myanmar’dan daha fazla mülteci kabul etmeyi reddediyor |
| פלסטינים ירו רקטות איראניות לעבר ישראל |
| ضحايا الفلسطينيون للهجوم الإسرائيلي المدعوم من الولايات المتحدة |
These news has been written in five different languages, but if we apply a method of geocoding where we try just to recognize the presence of states mentionned, we can produce three new sentences using the same language base on on ISO3 code of countries.
text1 <- "RUS UKR"
text2 <- "FRA DEU USA RUS UKR"
text3 <- "RUS SYR CHN"
text4 <- "USA SYR FRA "
text5 <- "USA MEX"
text6 <- "BGD MMR"
text7 <- "PSE ISR IRN"
text8 <- "PSE ISR USA"
text<-c(text1,text2,text3,text4,text5, text6, text7,text8)
kable(text)| x |
|---|
| RUS UKR |
| FRA DEU USA RUS UKR |
| RUS SYR CHN |
| USA SYR FRA |
| USA MEX |
| BGD MMR |
| PSE ISR IRN |
| PSE ISR USA |
Now imagine that we put this 8 sentences in quanteda (or any text mining package) as a corpus :
qd<-corpus(text)
summary(qd)Corpus consisting of 8 documents, showing 8 documents:
Text Types Tokens Sentences
text1 2 2 1
text2 5 5 1
text3 3 3 1
text4 3 3 1
text5 2 2 1
text6 2 2 1
text7 3 3 1
text8 3 3 1
We can see that our ISO3 code has been identified as tokens which can be analyze as the element of a new language which is universal as long as we consider our dictionaries of state identification as correct.
It is now possible to use powerful tool of textual analysis which is the function fcm (feature co-occurence matrix) which compute all cases of co-occurence of two tokens in the same sentence.
geo_fcm<-fcm(qd, tri=FALSE)
kable(as.matrix(geo_fcm))| RUS | UKR | FRA | DEU | USA | SYR | CHN | MEX | BGD | MMR | PSE | ISR | IRN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RUS | 0 | 2 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| UKR | 2 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FRA | 1 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| DEU | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| USA | 1 | 1 | 2 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
| SYR | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| CHN | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| MEX | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BGD | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| MMR | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| PSE | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
| ISR | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 |
| IRN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
We obtain a matrix which is in fact equivalent to a weighted network of co-coccurence of countries. A quick and dirty visualization of these object with a function of the package quanteda.textplots :
textplot_network(geo_fcm)This example is sufficient to demonstrate the interest of geo network for the analysis of press, especially if we adopt a cross language perspective. The reader familiar with network analysis will easily find the tools adapted for the visualization and analysis of networks derived from the fcm matrix.
6.2 Geo networks modelisation
We propose here an example of modelisation of geo network that is specifically adapted to the hypercube structure that we have developed previously. What we propose here is not to identify the geographical units that are the most related but the geographical units that are more related than expected according to a random model. We have elaborated for that a pipeline of functions
6.2.1 Hypercube Filter (function)
This function is dedicated to the selection of some parts of an hypercube
hc_filter <- function(don = hc,
who = "who",
when = "when",
where1 = "where1",
where2 = "where2",
wgt = "tags",
self = FALSE,
when_start = NA,
when_end = NA,
who_exc = NA,
who_inc = NA,
where1_exc = NA,
where1_inc = NA,
where2_exc = NA,
where2_inc = NA)
{
df<-data.table(who = don[[who]],
when = don[[when]],
where1 = don[[where1]],
where2 = don[[where2]],
wgt = don[[wgt]])
# Select time period
if (is.na(when_start)==FALSE) {
df <- df[when >= as.Date(when_start), ]}
if (is.na(when_end)==FALSE) {
df <- df[when <= as.Date(when_end), ]}
# Select who
if (is.na(who_exc)==FALSE) {
df <- df[!(who %in% who_exc), ]}
if (is.na(who_inc)==FALSE) {
df <- df[(who %in% who_inc), ]}
# Select where1
if (is.na(where1_exc)==FALSE) {
df <- df[!(where1 %in% where1_exc), ]}
if (is.na(where1_inc)==FALSE) {
df <- df[(where1 %in% where1_inc), ]}
# Select where2
if (is.na(where2_exc)==FALSE) {
df <- df[!(where2 %in% where2_exc), ]}
if (is.na(where2_inc)==FALSE) {
df <- df[(where2 %in% where2_inc), ]}
# eliminate internal links
if (self==FALSE) {
df <- df[(where1 != where2), ]}
return(df)
}6.2.2 Matrix builder (function)
This function build an interaction matrix (i, j, Fij) between states mentioned in the news. But offers the possibility to eliminate the spatial units that are not sufficiently present according to 5 parameters :
- s1, s2 : minimum sum of links in lines and columns
- n1,n2,k : minimum number of links of size greater than k in lines and columns
For example, the default parameters select only the units with a total sum of 10 and a minimum of 2 links greater than 1
build_int <- function(don = don, # a dataframe with columns i, j , Fij
i = "where1",
j = "where2",
Fij = "wgt",
s1 = 10,
s2 = 10,
n1 = 2,
n2 = 2,
k = 1)
{
df<-data.table(i=don[[i]],j=don[[j]],Fij=don[[Fij]])
int <-df[,.(Fij=sum(Fij)),.(i,j)]
int<-dcast(int,formula = i~j,fill = 0)
mat<-as.matrix(int[,-1])
row.names(mat)<-int$i
mat<-mat[apply(mat,1,sum)>=s1,apply(mat,2,sum)>=s2 ]
m0<-mat
m0[m0<k]<-0
m0[m0>=k]<-1
mat<-mat[apply(m0,1,sum)>=n1,apply(m0,2,sum)>=n2 ]
int<-reshape2::melt(mat)
names(int) <-c("i","j","Fij")
return(int)
}6.2.3 Random model (function)
This function solve the random model of allocation of flows (based on marginal sum) and provide as results :
- Fij : observed flow
- Eij : expected flow
- Rabs_ij : absolute residual (Fij-Eij)
- Rrel_ij : relative residual (Fij/Eij)
- Rchi_ij : chi-square of the difference between Fij and Eij
It is not recommended to apply this function on a too large table and a parameter maxsize is introduced in order to lmimiit the risk of crash of the program.
rand_int <- function(int = int, # A table with columns i, j Fij
maxsize = 100000,
diag = FALSE,
resid = FALSE) {
# Eliminate diagonal ?
if (diag==FALSE) {
int <- int[as.character(int$i) != as.character(int$j), ]}
# Compute model if size not too large
if (dim(int)[1] < maxsize) {
# Proceed to poisson regression model
mod <- glm( formula = Fij ~ i + j,family = "poisson", data = int)
# Add residuals if requested
if(resid == TRUE) {
# Add estimates
int$Eij <- mod$fitted.values
# Add absolute residuals
int$Rabs_ij <- int$Fij-int$Eij
# Add relative residuals
int$Rrel_ij <- int$Fij/int$Eij
# Add chi-square residuals
int$Rchi_ij <- (int$Rabs_ij)**2 / int$Eij
int$Rchi_ij[int$Rabs_ij<0]<- -int$Rchi_ij[int$Rabs_ij<0]
}
} else { paste ("Table > 100000 - \n
modify maxsize = parameter \n
if you are sure that your computer can do it !")}
# Export results
int$i<-as.character(int$i)
int$j<-as.character(int$j)
return(int)
}6.2.4 Visualize network (function)
This standard procedure is dedicated to the visualization of networls with a pleasant javascript interface. We do not explain here in details all parameters but notice that
- size and minsize : defines the width of edges between nodes
- test and mintest : defines the criteria of selection of edges. It will be generally the criteria of chi-square > 3.84 if we want to visualize non random associations with p < 0.05.
geo_network<- function(don = don,
from = "i",
to = "j",
size = "Fij",
minsize = 1,
maxsize = NA,
test = "Fij",
mintest = 1,
loops = FALSE,
title = "Network")
{
int<-data.frame(i = as.character(don[,from]),
j = as.character(don[,to]),
size = don[,size],
test = don[,test]
)
if (is.na(minsize)==FALSE) {int =int[int$size >= minsize,]}
if (is.na(maxsize)==FALSE) {int =int[int$size <= maxsize,]}
if (is.na(mintest)==FALSE) {int =int[int$test >= mintest,]}
nodes<-data.frame(code = unique(c(int$i,int$j)))
nodes$code<-as.character(nodes$code)
nodes$id<-1:length(nodes$code)
nodes$label<-nodes$code
nodes$color <-"gray"
nodes$color[nodes$code %in% int$j]<-"red"
# Adjust edge codes
edges <- int %>% mutate(width = 5+5*size / max(size)) %>%
left_join(nodes %>% select(i=code, from = id)) %>%
left_join(nodes %>% select(j=code, to = id ))
# compute nodesize
toti<-int %>% group_by(i) %>% summarize(size =sum(size)) %>% select (code=i,size)
totj<-int %>% group_by(j) %>% summarize(size =sum(size)) %>% select (code=j,size)
tot<-rbind(toti,totj)
tot<-unique(tot)
tot$code<-as.factor(tot$code)
nodes <- left_join(nodes,tot) %>% mutate(value = 1 +5*sqrt(size/max(size)))
#sel_nodes <-nodes %>% filter(code %in% unique(c(sel_edges$i,sel_edges$j)))
# eliminate loops
if(loops == FALSE) {edges <- edges[edges$from < edges$to,]}
net<- visNetwork(nodes,
edges,
main = title,
height = "1000px",
width = "70%") %>%
visNodes(scaling =list(min =20, max=60,
label=list(min=20,max=80,
maxVisible = 20)))%>%
visEdges(scaling = list(min=20,max=60))%>%
visOptions(highlightNearest = TRUE,
# selectedBy = "group",
# manipulation = TRUE,
nodesIdSelection = TRUE) %>%
visInteraction(navigationButtons = TRUE) %>%
visLegend() %>%
visIgraphLayout(layout ="layout.fruchterman.reingold",smooth = TRUE)
net
return(net)
} 6.3 Application
We apply our different functions to the hypercube of mobility. For a more pleasant visualization, we replace firstly the ISO3 codes by the full name of the countries.
# Load complete hypercube
hc <- readRDS("data/corpus/hc_mycorpus_states_mobil.RDS")
hc <- hc %>% filter(where1 !="_no_", where2 != "_no_")
# Eliminate non foreign news (french news for french newspapers ...)
hc<-hc[hc$where1 != substr(who,4,6),]
hc<-hc[hc$where2 != substr(who,4,6),]
# Add complete labels
map<-readRDS("data/dico_states/world_map_4326.Rdata")
labs<-st_drop_geometry(map)
labs<-labs[,c(1,4)]
# Shorten the name of USA
labs$NAME[labs$ISO3=="USA"]<-"U.S.A."
names(labs)<-c("where1","geofr1")
hc<-left_join(hc,labs)
names(labs)<-c("where2","geofr2")
hc<-left_join(hc,labs)
hc_geo_geo <- hc6.3.1 Reference network
We can firstly realize a network that combine all periods of time (when), all media (who) and all topics (what), which provide a reference for the next analysis.
hc<-hc_geo_geo
hc<-hc_filter(don = hc,
wgt = "news",
where1 = "geofr1",
where2 = "geofr2",
where1_exc = c("_no_"),
where2_exc = c("_no_"),
self = FALSE
)
int <- build_int(don = hc,
s1=2,
s2=2,
n1=2,
n2=2,
k=0)
mod<-rand_int(int,
resid = TRUE,
diag = FALSE)
network<- geo_network(mod,
size = "Fij",
minsize = 10,
test = "Rchi_ij",
mintest = 3.84)
network- Comment : The network is a combination of structural proximity (countries that are linked by relation not related to the period of observation) and event linkage related to specific events or crisis that has produced temporary connections generally not observed. The preferential linkage between Israël and Palestine is typically a structural linkage but the association between Liberia, Guinea and Sierra Leone has been induced by the Ebola outbreak. The strong linkage between Ukraine and Russia is a combination of the two types of effects.
6.3.2 Focus on a media
We can isolate a specific media like El Watan in order to examine to what extent his network is different fromthe reference network. But the parameters of selections has to be adapted because the size of the sample is smaller.
hc<-hc_geo_geo %>% filter(who == "fr_DZA_elwata")
hc<-hc_filter(don = hc,
wgt = "news",
where1 = "geofr1",
where2 = "geofr2",
where1_exc = c("_no_"),
where2_exc = c("_no_"),
self = FALSE
)
int <- build_int(don = hc,
s1=1,
s2=1,
n1=1,
n2=1,
k=0)
mod<-rand_int(int,
resid = TRUE,
diag = FALSE)
network<- geo_network(mod,
size = "Fij",
minsize = 2,
test = "Rchi_ij",
mintest = 3.84)
network- Comments : The network of preferential associations of countries from El Watan is relatively fragmented because of the small number of news where two countries are mentionned. But it reveals nevertheless interesting associations anda clear focus on the Mediterranean countries. The sub-network with Morocco, Spain and Western Sahara is a specific concern of Algeria related to a long term conflict. The focus on the Israelo-Palestinian conflict is also very visible. The so-called “Migrant crisis” is not so visible as in French or Belgium newspapers.
6.3.3 Focus on a topic
We can also propose a focus on the countries that has been specifically associated in relation with a topic. For example, the topic of migrants and refugees. We are oblige to reduce one more time the different thresholds as the number of news is very limited, but the result is interesting.
hc<-hc_geo_geo %>% filter(what != "_no_")
hc<-hc_filter(don = hc,
wgt = "news",
where1 = "geofr1",
where2 = "geofr2",
where1_exc = c("_no_"),
where2_exc = c("_no_"),
self = FALSE
)
int <- build_int(don = hc,
s1=1,
s2=1,
n1=0,
n2=0,
k=0)
mod<-rand_int(int,
resid = TRUE,
diag = FALSE)
network<- geo_network(mod,
size = "Fij",
minsize = 1,
test = "Rchi_ij",
mintest = 3.84)
network- Comments : The different components of the graph are related to different location of crisis. The major components are related to the “Syrian crisis” and the “Balkan Crisis” but we can also notice interesting dyads like Italy-Libya (boats sinkings in the straits of Sicile), France-UK (Calais), Myanmar-Bangladesh (Rohyngias), etc. The network is a fair mirror of the different locations or groups of locations where migrant and refugees was described as crossing border by the press.
Bibliographie
Annexes
Infos session
| setting | value |
|---|---|
| version | R version 4.0.2 (2020-06-22) |
| os | macOS Catalina 10.15.7 |
| system | x86_64, darwin17.0 |
| ui | X11 |
| language | (EN) |
| collate | fr_FR.UTF-8 |
| ctype | fr_FR.UTF-8 |
| tz | Europe/Paris |
| date | 2021-12-04 |
| package | ondiskversion | source |
|---|---|---|
| data.table | 1.13.0 | CRAN (R 4.0.2) |
| dplyr | 1.0.2 | CRAN (R 4.0.2) |
| ggplot2 | 3.3.3 | CRAN (R 4.0.2) |
| knitr | 1.34 | CRAN (R 4.0.2) |
| lubridate | 1.7.9.2 | CRAN (R 4.0.2) |
| plotly | 4.9.2.2 | CRAN (R 4.0.2) |
| quanteda | 3.0.0 | CRAN (R 4.0.2) |
| quanteda.textplots | 0.94 | CRAN (R 4.0.2) |
| RColorBrewer | 1.1.2 | CRAN (R 4.0.2) |
| readr | 1.4.0 | CRAN (R 4.0.2) |
| readtext | 0.80 | CRAN (R 4.0.2) |
| rmarkdown | 2.11 | CRAN (R 4.0.2) |
| rzine | 0.1.0 | gitlab (rzine/package@a94bf55) |
| sf | 0.9.8 | CRAN (R 4.0.2) |
| stringr | 1.4.0 | CRAN (R 4.0.2) |
| tidytext | 0.2.6 | CRAN (R 4.0.2) |
| visNetwork | 2.0.9 | CRAN (R 4.0.2) |
Citation
@Manual{ficheRzine,
title = {Titre de la fiche},
author = {{Auteur.e.s}},
organization = {Rzine},
year = {202x},
url = {http://rzine.fr/},
}